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Common and Individual Features Analysis: 
Beyond Canonical Correlation Analysis 

Guoxu Zhou, Andrzej Cichocki Fellow, IEEE, and Shengli Xie Senior Member, IEEE, 

Abstract — Very often data we encounter in practice is a collection of matrices rather than a single matrix. These multi-block data are 
naturally linked and hence often share some common features and at the same time they have their own individual features, due to 
the background in which they are measured and collected. In this study we proposed a new scheme of common and individual feature 
analysis (CIFA) that processes multi-block data in a linked way aiming at discovering and separating their common and individual 
features. According to whether the number of common features is given or not, two efficient algorithms were proposed to extract the 
common basis which is shared by all data. Then feature extraction is performed on the common and the individual spaces separately 
by incorporating the techniques such as dimensionality reduction and blind source separation. We also discussed how the proposed 
CIFA can significantly improve the performance of classification and clustering tasks by exploiting common and individual features 
of samples respectively. Our experimental results show some encouraging features of the proposed methods in comparison to the 
state-of-the-art methods on synthetic and real data. 

Index Terms — Correlation Analysis, Linked Blind Source Separation, Common and Individual Feature Analysis, Classification, 
Clustering 
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1 Introduction and Motivation 

MASSIVE high-dimensional data are increasingly 
prevalent in many areas of science. A variety of 
data analysis tools have been proposed for different 
purposes, such as data representation, interpretation, 
information retrieve, etc. Recently, multi-block data anal- 
ysis has attracted increasing attention [1], [2], [3], [4]. 
Multi-block data is encountered when multiple mea- 
surements are taken from a set of experiments on a 
same subject using various techniques or on multiple 
subjects under similar configurations. For example, in 
biomedical studies, human electrophysiological signals 
responding to some pre-designed stimuli will be col- 
lected from different individuals and trials. A number of 
different existing technologies and devices may be used 
to collect diverse information from different aspects. All 
these result in naturally linked multi-block data. These 
data should share some common information due to 
the background in which they are collected, and at the 
same time they also possess their individual features. It 
is consequently very meaningful to analyze the data in a 
connected and linked way instead of a separate one. This 
study is devoted to such an interesting and promising 
topic. 

Actually there have been some methods developed 
for multi-block data analysis. For example, canonical 
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correlation analysis (CCA) was proposed to maximize 
the correlations between the random variables in two 
data sets [5], [6], [7]. Later CCA was generalized to 
analyze multiple data sets and applied to joint blind 
source separation and feature extraction [1], [8], [9], [10]. 
In contrast to CCA, Partial Least Squares (PLS) maxi- 
mizes the covariance rather than correlations [11], [12], 
[13]. To analyze image populations, a framework named 
Population Value Decomposition (PVD) was proposed 
for the data sets which have exactly same size [3] . It turns 
out that PVD can actually be studied in the more general 
framework of tensor (Tucker) decompositions, which is 
another hot topic for high-dimensional data analysis and 
exploration in recent years, see [14], [15] and references 
therein. A method named Joint and Individual Variation 
Explained (JIVE) was proposed for integrated analysis of 
multiple data types [2], together with a new algorithm 
which extracts their joint and individual components 
simultaneously. To our best knowledge, however, their 
potential as a common and individual feature analysis 
tool has not been fully exploited. 

In this study a general framework of Common and 
Individual Feature Analysis (CIFA) was proposed for 
multi-block data analysis. Compared with the existing 
works, our main contributions include: 

1) New efficient algorithms were proposed to extract 
common orthogonal basis from multi-block data 
according to whether the number of common com- 
ponents c is given or not. 

2) A detailed analysis on the relationship between 
the proposed methods and other related methods 
such as CCA and principal component analysis 
(PCA) was discussed. Our results show that com- 
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mon feature extraction can be interpreted as high- 
correlation analysis and it performs PCA on the 
common space shared by all data rather than on 
the whole data which are used in ordinary PCA. 

3) In the proposed framework various well- 
established data analysis methods proposed 
for a single matrix, e.g., dimensionality reduction 
[16], [17], Blind Source Separation (BSS) [18], 
Nonnegative Matrix Factorization (NMF) [15], can 
be easily applied to common and individual spaces 
separately in order to extract components with 
desired features and properties, which provides a 
quite flexible and versatile facility for multi-block 
data analysis tasks. 

4) Two important applications of CIFA, i.e., classifi- 
cation and clustering, were discussed, which illus- 
trated how the extracted common and individual 
features are able to improve the performance of 
data analysis. 

The rest of the paper is organized as follows. In Sec- 
tion 2 the common orthogonal basis extraction (COBE) 
is discussed, including the problem statement, model, 
algorithms, and its relationship with CCA, PCA, and 
other related methods. In Section 3 the general frame- 
work of common and individual feature analysis (CIFA) 
is presented. The applications of CIFA in classification 
and clustering are discussed in Section 4. In Section 5 
simulations on synthetic data and real data justify the 
efficiency and validity of the proposed methods. Finally 
we provided some concluding remarks and suggestions 
for future work in Section 6. 

2 Common Orthogonal Basis Extrac- 
tion 

2.1 Problem Formulation 



Given a set of matrices y = {Y n e 



i/xj„ 



n e Af}, Af 



{1,2,..., TV}, consider the following matrix factorization 
problem of each matrix Y„: 

(1) 



mm 



||Y„ - A n B^\\%, n£Af : 



where the columns of A„ £ R IxRn consist of the latent 
variables in Y„ (sources, basis, etc ), B„ £ R J ^ xR ™ 
denotes the corresponding coefficient matrix (mixing, 
encoding, etc ). R n is the number of latent components 
with R n < I, which generally corresponds to a com- 
pact/compressed representation of Y„. The necessity 
and justification of this assumption will be discussed in 
Section 2.4. 

So far a very wide variety of matrix factorization 
techniques has been proposed for (1), such as PCA, 
independent component analysis (ICA) [19], [20], BSS 
[18], etc. In these methods the matrices Y„ are treated 
independently and separately. Here we consider the case 
where the data Y„ are naturally linked and share some 
common components such that 



where A £ K /XC ,_A„ £ r'x an d c < m i n {R n : 

n £ Af}. In (2), A contains the common components 
shared by all matrices in y while A„ contains the 
individual information only presented in Y n . In this way, 
the matrices in y are factorized in a linked way such that 



B, 
B, 



Y n ~ A n B n — [A A n ] 

=Y„+Y„, neU, 



(3) 



where B„ and B„ are the compatible partition of B„. 
In other words, each matrix Y„ is represented by two 
parts: the common space Y„ = ABj and the individual 
space Y„ = A„B^, which are spanned by the common 
components (i.e., columns of A) existing in all Y fe (k £ 
Af) and its individual components A„ only presented 
in Y„, respectively. Our problem is to seek A and A„ 
from a given set of matrices Y„, n £ Af, without the 
knowledge of B„ and possibly the number c. Note that 
two special cases of (3) have been extensively studied in 
the past decades: 

• c — 0. No common components exist in Y„ and 
the problem is simply equivalent to factorizing each 
matrix separately. 

• c = R n for all n. The problem is equivalent to ordi- 
nary matrix factorization of a large matrix created 
by stacking all matrices Y„. This will be further 
detailed in the end of this sub-section. 

Note that the solution is not unique since Y„ = 
(AQ)(Q _1 B^) is also a solution to (3) for an arbitrary 
invertible matrix Q with proper size. To shrink the solu- 
tion space and simplify the computation we let A = UR 
be the QR-decomposition of A such that U T U = I (the 
matrix I denotes the identity matrix with proper size. In 
the case where the size should be specified explicitly we 
use \ c to denote the c-by-c identity matrix). Substitute 
them into (3) we have 



A„B^ = [U K 



RB^ 

6? 



n 



(4) 



Comparing (3) and (4), we can assume that A T A = I in 
(3) hereafter, without loss of any generality. 

Taking our purpose into consideration, we further 
assume that A T A„ = 0,n £ Af, where is the zero 
matrix with proper size. This assumption means that 
there is no any interaction between common features 
and individual features. This assumption will not cause 
any additional factorization error. To see this, Vn, if 
A T A„ ^ 0, from (3) and the fact that 



-An — 



A A r , 



,n£Af, 



(2) 



A„ = AA T A„ + (I - AA T )A„ 



(5) 
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we have 

A„B£=AB£ + A„B£ 

=AB^ + [AA T A„ + (I - AA T )A„]B^ 
=A[B^ + A T A„B^] + [(I - AA T )A„]B 

rsi + a t a„b^ 



(6) 



[A (i-AA T )A„] 



B 



T 



Compare (6) and (3) and define A n = (I — A A )A„ and 
B„ = B„ + B„A^A, we have A T A„ = immediately. 
As a result, it is reasonable to assume that A T A„ = 0. 

Furthermore, we consider the truncated singular 
value decomposition (SVD) of A„ = U„A„V^, where 
U^U„ = I, V£V_„ = I, and A„ e R (fl»-c)x(fl„-c) is 
invertible. Then A T A„ = A T U„ =_ 0. Define 
A n = U„ and B n = A„V^B„. We have A T A„ = 
and A^A„ = I. 

Based on the above analysis, the general problem we 
consider can be formally formulated as: 



s.t. 



En Y « 

A T A = I c , A T 
A T A„ - 0, 



AB^-A„B^| 



,„ A n — Ifl n _ c , 
n e Af. 



(7) 



In (7) we have implicitly assumed that the number of 
common components c is known. How to estimate c 
in practice and how to solve (7) will be discussed in 
section 2.2 and 2.3. Compared with (1), the procedure 
of (2)-(6) does not cause any additional decomposition 
error. Hence the restriction of rank(A) +rank(A„) = R n 
implicitly guarantees that we are seeking common com- 
ponents. Indeed, once A contains information other than 
common components, the total decomposition error will 
increase under this rank restriction. 

It is worth noticing that once R n = c for all n G Af, the 
problem is reduced to be ordinary PCA, or equivalently 
low-rank approximation of matrices. In this case A can 
be found by solving 



E n Y « - 

s.t. A T A = I c . 



mm 

A 



AB 



T || 2 
F 



(8) 



Let Y = [Yi Y 2 • ■ • Y N ] be the I x (£ n J n ) matrix 
by stacking all matrices Y„ horizontally, and similarly 
let B = [Bi B 2 • • • Bjv] . Then (8) can be viewed as 
a partitioned version of PCA 



mm 

A 



l|Y- AB T ||| 



s.t. A A = I r 



(9) 



If Y is too large to fit into physical memory, we may 
solve (8) instead of (9) in practice. 

When c < R n , model (7) is distinguished from (8) 
due to the involved individual parts A„B^ and the 
rank restriction discussed above. From this sense (7) 
can also be interpreted as the principal components of 



their common space Y„ — A„B^, i.e., the residuals after 
removing their individual components. Unfortunately, as 
the individual parts are also unknown and can have very 
large variance (energy), we cannot solve A„ by running 
standard PCA on Y„. 

We use two steps to solve (7): in step 1 matrices Y„ in 
(7) are updated by their optimal rank-i?„ approximation 
A„B^ by solving (1) separately for each Y„. To distin- 
guish, we call the original Y„ raw data while call the 
reduced version Y„ <— A„Bj cleaned data. In step 2, (7) 
is solved by using the cleaned data. Due to (2)-(6) which 
means that no additional error arises from the separation 
of individual and common spaces, theoretically we have 



Y„ = Mil + A„B 



(10) 



In section 2.2 and 2.3 we will focus on the second step. 

2.2 The COBE Algorithm: the Number of Common 
Components c is Unknown 

From (7) (or (10)), once A has been estimated, B„ can 
be computed from 1 



B„ = (y£ - B„A£) A(A t A)' 



Yi A. 



(11) 



After that A„ can be computed via truncated singular 
value decomposition (tSVD) of the residual matrix Y„ = 
Y„ — AB^, Vn e Af. In other words, estimating A plays 
the central role to solve (7). In this section we focus on 
the problem of how to estimate A efficiently. 

For any B„ and B„, the optimal A and A„ in (7) 
satisfy that 

[A A n ]=Y„B^ t , A^A„=I^, neAf, (12) 

where A„ = [A A„] (as in Eq.(2)), B„ = [B„ B„], 
and (-)^ denotes the Moore-Penrose pseudo inverse of 
a matrix. Let Y„ = Q„R„ such that Q^Q„ = I (For 
each matrix Y„ this only needs to be computed once by 
using, e.g., QR decomposition or truncated SVD of Y„). 
Then we define Z„ = R„B^, and (12) is equivalent to 

[A A„] = Q„Z„, n e AT, (13) 

and hence for any m, n 2 G AT, n\ ^ n 2 , there holds that 

J Qni z ni,fc — Qn 2 z ri2,fc = if k < C] 

I Qni z ni,fc 7^ Qn 2 z ri2,fc if fc > C, 



(14) 



where z„ ;fc and a fc are the fcth column of Z„ and A, 
respectively. 

According to (14), the first column of A, i.e., ai, can be 
obtained by solving the following optimization model: 



min /i = V" ||Q„z n i - ai| 

ii 7„ i ' ■ 



2 

F ■ 



(15) 



s.t. a x ai = 1. 



1. If Y n = AB^ + A„B£ is exact as in (10), Equation (11) is 
also exact. Otherwise (11) is interpreted as the least square solution 
of min ||Y„ - AB^ - A n B^||f,. Similarly for equations (12) and (13). 
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We use alternating least-square (ALS) iterations to solve 
(15). Fix z„ i first and the optimal ai is given by 



Algorithm 1 The COBE Algorithm 



a l — ^ ^ Qri z n, 



(16) 



and then ai is normalized to have unit norm. Then ai 
is fixed and we get 

Zn,l=Qn*l> n E Af. (17) 

By running (16) and (17) alteratively till convergence. If 
min/i < e for a very small threshold e > 0, a common 
column ai is found. Otherwise, no common basis exists 
in y, and we terminate the procedure. 

Now suppose that K k have been found and we seek 
the next common basis a fc+ i. To avoid finding repeated 
common basis, we consider a useful property of Z„. Let 
Z„ )C denote the matrix consisting of the first c columns 
of Z„. From (13) we have 

Z^ c Z„ jC = Z^ C Q^Q„Z„, C = A T A = I, (18) 

which means that z n f,^^Z n k — 0, i.e., z n t k+i is in the 
null space of Z^ fe . Hence we update Q„ as 

Ql fe+1 ) = Q„(I-Z„, fc Z^) 



where Q„ 



(i) 



mm 

a (fc + l) , z„,fc + i 



Q ra . Then a fc+ i can be found by solving 

fk+i =^ ||Q^ +1 z« : fe+i - a fe+ i 



2 



(20) 



s.t. 



afc +1 a fc+ i 



1. 



Repeating the procedure done for (15) 2 , the minimum of 
fk+i can be obtained. Again, there are two cases: 

1) min/ fc+ i < e. In this case a new common basis 
vector &k+i is found. Update Qi fe+1 ^ using (19) and 
then solve (20) to seek the next common basis. 

2) Otherwise, no common basis vector exists any 
more and a total of c = k common orthogonal basis 
vectors are found as A = [ai a 2 • • • a c ] . 

By this way an orthogonal basis of common space can 
be found sequentially. This procedure is called common 
orthogonal basis extraction (COBE) and is presented as 
Algorithm 1. 

The parameter e controls how identical the extracted 
components are. If e = 0, the extracted components 
are exactly the same. Otherwise, approximately identical 
(or equivalently highly correlated) components will be 
extracted (see section 2.5 for detailed discussion). We can 
adopt the SORTE method proposed in [21] to select the 
parameter e automatically. Basically, the SORTE detects 
the gap between the eigenvalues corresponding to signal 
space and those to noise space. Here we can detect 
the gap between common space and individual space 
similarly. We will illustrate this in simulations. 

(k) 

2. For k > 1 the matrix Q„ is not orthogonal any more. However, 
it can be verified that Q^ T is the More-Penrose pseudoinverse of 
Q!n \ thereby leading to the least square solution z n fc = Qn fc ' T ai. 



Require: Y„, n e Af, e > 0. 
1: Let Y„=Q„R„ such that Q^Q„ = I for all n e N. 
A = [ ], O.^ - Q n , and k = 1. 
while fk < e do 

Q { n k) = Q { n^\l - z„,fc-i<k-i) i£ k > 1, n eAf. 
while not converged do 

a fe — X/n Q™ z n,k/ 

z n ,k = [Qn } ] T a fc , n e M; 
end while 

A = [A a fe ] ; 
k = k + 1; 
end while 

return A = [ai a 2 • • • a c ], where c = k — 1. 



J2n Q^'zn.fellF; 



2.3 The COBE Algorithm With Specified Number of 
Common Components c 

We briefly discuss the case where c is given. Following 
the analysis in section 2.1, we solve the following model: 



N 



min ^||Q„Z„-A| 

" ' 71=1 

s.t. A T A = I. 



(21) 



Again, we optimize with respect to Z„ and A alterna- 
tively. When A is fixed, the optimal Z„ is computed from 

Z„ <- Q^A, n E Af. 



(22) 

And when Z„, n e Af, are fixed, (21) is equivalent to 

(23) 



max trace(P A) 

A 

s.t. A T A = I 



where trace(-) denotes the trace of a matrix and 



P — J]] QnZ„. 



(24) 



n=l 



Let P = EAV T e M /Xc be the truncated SVD of P, 
where A = diag(Ai, A 2 , • • • , A c ) e R cxc is a diagonal 
matrix with Ai > A 2 > • • • > A c > 0. Motivated by 
the work in [22] (page 601) 3 , we show that the optimal 
solution of (23) is 

A = EV T . (25) 

In fact, 

trace(P T A) = trace(VAE T A) = trace(A(E T AV)). 

As E T E = I and (AV) T (AV) = I, we have [E T AV]i, < 
1 , which means that trace(A(E T AV)) < Ei=i A i- 
Clearly, when A = EV T , there holds that A T A = I, 
E T AV = I and trace(P T A) reaches its upper bound 

The pseudo-code is presented in Algorithm 2. 

3. The main difference is that here A is unnecessarily square. 
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Algorithm 2 The COBEc Algorithm 



Require: c and Y„, n e Af. 
1: Let Y„=Q„R„ such that Q^Q„ = I for all 
2: Initialize Z„ randomly. 
3: while not converged do 

4: P = ZlnGJV Q« Z «- 

A=EV T , where [E, A, V] = tSVD(P,c). 
Z„ <- QjA. 
end while 
return A. 



2.4 Pre-processing: Dimensionality Reduction 

Like CCA, COBE loses its sense if R n = I for all n, 
because in this case for any I x / invertible matrix A there 
always exists matrices B„ such that Y„ = AB^, i.e., any 
I x I invertible matrix forms a common basis. Hence in 
model (7) R n < I is required for all Y„. Fortunately, this 
requirement is actually not so restrictive as we think. 
This is because that, in practice although observation 
data can be of very high dimensionality, the latent rank 
is often significantly lower than the dimensionality of 
observation data [23]. And even if this condition does 
not hold, we perform dimensionality reduction, such as 
PCA, on the raw data by solving (1) before running 
COBE, which has been stated at the end of section 
2.1. By the dimensionality reduction step only principal 
components are targeted and subsequent computational 
complexity can be significantly reduced. Another strong 
reason of running dimensionality reduction is to reduce 
noise. Indeed, the significance of dimensionality reduc- 
tion has been extensively justified in the literature. From 
this sense, if A„B^ in (1) is interpreted as the PCA of 
each matrix Y„, in COBE we simply rotate/ transform 
the columns of A„ such that the common basis and the 
individual basis are completely separated. 

One of the most widely used dimensionality reduction 
method may be principal component analysis (PCA) 
which is based on the assumption that the noise is drawn 
from independent identical Gaussian distributions. Oth- 
erwise if the noise is very sparse, we may consider robust 
PCA (RPCA) [24]. Moreover, we may use the SORTE 
[21] or related techniques to estimate the number of 
latent components R n and then use PCA to perform 
dimensionality reduction. 

2.5 Relation With Other Methods 

The COBE has a very close relation with canonical 
correlation analysis (CCA). For two given sets of data 
Yi and Y 2 , CCA seeks vectors wi and w 2 such that 
the correlation p = corr(YiWi, Y 2 w 2 ) is maximized. In 
COBE, however, only the components with the correla- 
tion higher than a specified threshold will be extracted. 
Let Y„ are row-centered (i.e., with zero mean) random 
variables. We have 

Proposition 1: Suppose that ||Y„w„ — a|| jr < e < | with 



||a|| F = 1, Vn e Af. Then 
corr(Y m w m , Y„w„) > 1 



4e 



, Vra, n e Af. (26) 



Proof: From ||a|| F = 1 and ||Y„w„ -a|| F < e < 1, we 
have 

0< l-e< ||Y„w„|| F < 1 + e, VneAA. (27) 
Moreover, Vra, n e Af, there holds that 



w m - Y„w 



<||Y m w m - aj| F + ||Y„w„ - a|| F 
<2e. 



(28) 



Hence, 

2w^Yf n Y„w„ 
=||Y ro w ro || F + ||Y„w„|| F - ||Y m w m - Y„w„|| F (29) 
>2(l-e) 2 -4e 2 . 

From (27) and (29), we have 



corr(Y m w m , Y n w„) = 



> 



> 
= 1 
> 1 



|Y m w m || F || Y„w„|| F 

1 - 2e - 6 2 
|Y m w m || F ||Y„w„|| F 

1 - 2e - e 2 



(1 + e) 2 
4e + 2e 2 

4e 
~ 1 + e" 



(30) 



This ends the proof. □ 
From the proposition, once in (15) and (20) are upper 
bounded, the correlations between the projected vari- 
ables Y„w„ are consequently lower bounded. Particu- 
larly, corr(YiWi, Y 2 w 2 ) — >• 1 as e — >• 0. This shows that 
COBE actually can be interpreted as high correlation 
analysis (HCA) that differs from canonical correlation 
analysis (CCA) for multiple data sets. 

The following Fig.l illustrates the relationship be- 
tween COBE and CCA. Given two matrices A„ e 
R ioooxiO ; n = 1,2, let a M , i.e. the first column of A i7 
be the sine wave ai^t) = sin(O.Olt) and a 2 ,i(f) = 
sign(ai > i(t)), where t = 1,2, ■■■ ,1000. The entries of 
the other components were drawn from independent 
standard normal distributions. Each matrix was mixed 
via a different matrix B„ e M 10xl ° whose entries were 
drawn from independent standard normal distributions 
such that Y„ = A„B^ (n = 1, 2). The red line in Fig.l(a) 
shows the common components a extracted by COBE, 
and the corresponding components projected onto a, i.e., 
(Yja n ), n = 1,2, match very well to the 
projected components obtained by CCA. In this sense, 
COBE realizes CCA from another aspect. However, in 
COBE only the components with very high correlations 
will be extracted, as stated in Proposition 1. From the 
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figure, a can be interpreted as the principal compo- 
nent of Y„w„, the information that is not provided in 
CCA. Note also that in the proposed method, the com- 
mon components (highly correlated features) satisfy that 
A T A„ = I, which makes COBE like a regularized CCA 
[9], [25], [26]. Finally, due to its close relationship with 
CCA, COBE inherits most connections and differences 
from CCA with other related methods such as PLS [27], 
alternating conditional expectation (ACE) [28], etc. 



o 

-0.02 
-0.04 



7^ 









■ Y 2 w 2 



400 600 
(a) COBE 




■ YiW[ 

■ Y 2 w 2 



400 600 
(b) CCA 



Fig. 1: Illustration of the relation between COBE and 
CCA. Generally, COBE focuses on highly correlated com- 
ponents and returns the principal components of them. 

From (7) and (8), we know that COBE also has close re- 
lation with PCA. Fig. 2 illustrates the difference between 
COBE and PCA using the above data (The principle 
component is computed from the concatenated version 
Y defined in (9)). Basically, COBE seeks the principal 
components A of the common space (spanned by com- 
mon or very similar basis) of all data whereas PCA seeks 
the principal components of all data, which makes COBE 
quite useful to find highly relevant and related informa- 
tion from a large number of sets of signals. Moreover, 
as (7) can be interpreted as the PCA of the common 
space of all data sets, or the PCA of the individual 
space of each single data set, we may optimize (7) by 
using a series of alternating truncated SVD (PCA), which 
is the way adopted by the JIVE method [2]. This way 
involves frequent SVDs of huge matrices formed by all 
data in the computation of the common space and hence 
it is quite time consuming. Compared with JIVE, the 
COBEc method is more efficient in optimization and 
more intuitive and flexible in the estimation of number 
of common components. 

2.6 Scalability For Large-Scale Problems 

In multi-block data analysis it is common that the data 
we encounter is huge. Here huge means that both / and 
J n {n e AO are quite large. Note that in (15), (20), and (21) 
we actually use the dimensionality reduced matrices Q e 
K 7xfl « with R n < I. Hence the value of J n is generally 
not a big issue. In the case where / is extremely large, 
we consider the following way to significantly reduce 
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Fig. 2: Illustration of the relation between COBE and 
PCA. The COBE method finds the principal components 
of the highly correlated columns whereas PCA finds the 
principal components of all columns. 



the time and memory consumptions of COBE. Let P g 
M 7pX/ be a random matrix with ma,x ne ^(R n ) < Ip -C I. 
From (12) we may solve the following model first: 



Y?W n -K p \\ F 



(31) 



min | 

where Y p = PY„ € R IpXj ™ is much smaller than 
Y„, and A p = PA. After W„ have been estimated 
by using COBE or COBEc, the corresponding common 
features can be computed from A = Y„W„. Obviously, 
PY n W n = PA as long as Y„W„ = A. In other words, 
this way will not lose any common features. In the worst 
case, however, (31) may give fake common features aj. 
when Y n w n ^k — a^ occasionally lies in the null space 
of P. Fortunately this rarely happens in practice and 
these fake common features can be easily detected by 
examining the value of ||Y„w„ jfc — a. k \\ F . 

3 Common And Individual Feature Anal- 
ysis 

3.1 Linked BSS with Pre-whitening 

So far we only impose orthogonality on the components 
A. In this case the common components are not unique 
as the columns of AU also form a common orthogonal 
basis for any orthogonal matrix U. Sometimes we want 
to project the common components onto a feature space 
with some desired property or uniqueness. This can be 
done typically by, for example, blind source separation 
(BSS) [18]. BSS is a problem of finding latent variables S 
from their linear mixtures Y = SM T such that 



*(Y) 



SPD 



(32) 



where \1/ denotes a BSS algorithm, M is the mixing 
matrix. P and D are a permutation matrix and a diagonal 
matrix, respectively, denoting unavoidable ambiguities 
of BSS. In other words, by using BSS methods the sources 
can be exactly recovered from their mixtures, only re- 
maining a scale and permutation ambiguity without any 
knowledge of the mixing matrix M. Hence BSS is quite 
attractive and has been severed as feature extraction 
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tools in a wide range of applications, such as pattern 
recognition, classification, etc. If we assume that the 
latent features (sources) F satisfy that 

Y„ = FMl, (33) 

where Y„ is defined in (3). From Y„ = AB^ we have 

A = F(Bt M„) T . (34) 

Consequently, the columns of A are just the linear mix- 
tures of F and hence F can be estimated via 



F = *(A), 



(35) 



by using a proper BSS algorithm ^. In this case A is 
actually the pre-whitened version of (33), from (34) and 
the fact that A T A = I. By using BSS, we may obtain 
the common features with desired properties such as 
sparsity, independence, temporal correlations, nonneg- 
ativity, etc, by imposing proper penalties on F, or even 
nonlinear common features by using kernel tricks [20]. 
We call the above BSS procedure linked BSS because 
we perform BSS on multi-block linked data Y n . Note 
that the JBSS method in [1] also performs BSS involving 
multi-block data. It extracts a group of signals with 
the highest corrections each time and it requires that 
the extracted groups have distinct corrections. In other 
words, the JBSS method is actually a way to realize BSS 
by applying multiple-set CCA. In contrast, the linked 
BSS method extracts common basis first and then applies 
ordinary BSS to it to discover common components with 
some desired property and diversity. 

3.2 Common Nonnegative Features Extraction 

In the case where F is required to be nonnegative, 
we cannot run NMF methods on A directly. In this 
case we use two steps to extract nonnegative common 
components. First, from (11) the common space, i.e., 

— — — T 

Y„ = AA Y„, can be extracted. Then we consider the 
following low-rank approximation based (semi-) non- 
negative matrix factorization (NMF) model [29]: 



min^||FM^- AB 



T || 2 
F 



(36) 



s.t. F^O. 



By using low-rank NMF (if M„ is also nonnegative) 
or low-rank semiNMF (where M„ is arbitrary) we can 
extract the common nonnegative components F. For 
example, by using the following multiplicative update 
rules iteratively both F and M are nonnegative: 



M„ <- M„ © 



[AE n (B£Mn)] + 
F(£„M£M„) ' 
[B„(A T F)] + 



(37) 



M„(F T F) 



where © and g are element-wise product and division 
of matrices. See [29] for detailed convergence analysis. 



3.3 Individual Feature Extraction (IFE) 

In the above section we discussed common feature ex- 
traction (CFE). Besides the common features F or A, each 
data also has its own individual features contained in the 
matrix Y„ = Y„ — Y„. These individual features are 
often quite helpful in classification and recognition tasks. 
Although Y„ has the same size as Y„, it is rank deficient 
and satisfies that rank(Y„) + rank(Y) = R n . Hence 
dimensionality reduction on Y„ should be a top priority 
before further analysis. We can run any dimensionality 
reduction method discussed in section 2.4 on each Y„ 
separately to estimate A n and B n , and then use BSS or 
related methods to extract the features in A„ and B„. 
However, there is a major difference between the dimen- 
sionality reduction methods considered here and those 
in the pre-processing stage. In the pre-processing stage 
dimensionality reduction is rather general-purposed and 
relatively simpler, whereas in this stage, the dimension- 
ality reduction is more closely related to the specific 
purpose of tasks at hand. For example, if we want to 
visualize the data in low-dimensional space, we may 
consider the methods discussed in [17]. For classification 
and recognition tasks, we may need to extract discrim- 
inative information, neighbor relationship, etc, as much 
as possible [30]. See also [6] for a unified least-squares 
framework for various component analysis. In summary, 
careful selection of dimensionality reduction methods in 
this stage is quite critical to successfully achieve ultimate 
purpose. The above procedure is called as individual 
feature extraction (IFE) as the extracted features are only 
presented in each individual data. 

Finally, we give the flow diagram of the proposed 
common and individual feature analysis (CIFA) in Fig. 3. 

4 Two Applications 

4.1 Classification Using Common Features 

In classification and pattern recognition tasks, we have 
a set of training data consisting of training samples and 
their labels. It is natural that the objects belonging to a 
same category must share some common features. Let 
F fc denote the common features extracted from the fcth 
category, k e /C = {1,2, ••• ,K}. Then for a new test 
sample y t , we compute its matching score r t {k) with 
each F fe : 



r t {k) = Matching(y t ,F fc ), k e K. 



(38) 



As the samples in a same class should share some same 
features, the label of y t is estimated as 



l t — arg max r t (k) . 



(39) 



There are many choices to define r t (k), such as the 
Euclidean distance or correlation (angle) between y t and 
the space spanned by F fe , which can be solved via least- 
square and CCA, respectively. 

Note that for the linear discriminative analysis clas- 
sifier (LDA), the number of features should be sig- 
nificantly less than the number of samples to ensure 
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Fig. 3: Flow diagram of the general common and individual feature analysis (CIFA). 



the positive definiteness of the covariance matrix. The 
proposed method has no such a limitation. 

4.2 Clustering Using Individual Features 

Clustering is a task of assigning a set of objects into 
clusters such that the objects belonging to a same cluster 
are of the most similarity. Cluster analysis is widely 
applied to data mining, machine learning, information 
retrieval, and bioinformatics. Different from classifica- 
tion, clustering is a typical unsupervised learning ap- 
proach, that is, there are no training data available. 
In cluster analysis, we need to compare the similarity 
between samples. For many practical applications, all 
the samples may have some common features, although 
they are in different clusters and certainly have some 
dissimilarity. For example, in human face image analysis, 
every face has common facial organs such as cheek, 
nose, eyes, and mouth, etc, and they often share some 
same features to some extend reflecting their shapes 
and locations, etc. The common features presented in 
all samples are useless for clustering as they do not 
provide any discriminative information between them. It 
is therefore reasonable to remove these common/ similar 
features at first and then used their individual features 
to cluster the objects. Intuitively, this should significantly 
improve the clustering accuracy when all objects have 
common features. 

In Fig.5 we show how COBE incorporating CNFE 
is able to extract common faces (features) on the PIE 
database (Details of the PIE database are given in the 
next section). Here we manually set c = 2 and used 
CNFE to extract the common nonnegative components. 
From the common faces shown in Fig. 5(a), we can 
observe some basic profile of human faces. In Fig. 5(b), 
their individual local features are accentuated. These in- 
dividual features should be quite helpful to improve the 
accuracy of clustering and recognition tasks. Generally, 
in our individual features based clustering method we 
follow the steps below: 

1) Randomly split the samples y f into N groups to 
construct Y„, where t G T = {1, 2, • • ■ , T} and n e 
N. 



Training data 



Matching score 



Class 1 



Class 2 



Class K 



3>i 


CFE 


- W 


3>2 1 


CFE 1 
— w 






Fig. 4: Flow diagram of classification by using common 
features extracted from each class. 







(a)F 




(b)f t 



Fig. 5: Illustration of how COBE incorporating CNFE is 
able to extract the common features on the PIE database, 
(a) Common faces, (b) The first 64 samples of individual 
faces obtained by removing the common components. 
Their local individual features are accentuated. 



2) Run COBE to extract the common features A of 

{Y n , neAf}. 
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3) Remove their common features from Y„ by letting 

A A Y n . 

4) Perform dimensionality reduction and feature ex- 
traction on Y n to obtain features F„. 

5) Run clustering algorithms on {f t , t G T}, where f t 
are the columns of F„ corresponding to the original 
objects y t . 

See Fig. 3 for more details. Note that the dimensionality 
reduction and feature extraction methods considered 
here should be able to substantially benefit the clustering 
purpose. 

5 Simulations and Experiments 

Linked BSS. In this simulation we generated a total of 
ten matrices A„ e n 5000x10 , n = 1, 2, . . . , 10, whose first 
four columns were the speech signals included in the 
benchmark of ICALAB (named Speech4.mat) [31], and 
the other six components were drawn from indepen- 
dent standard normal distributions. The entries of the 
mixing matrices B„ € jj50xio were a i so drawn from 
independent standard normal distributions. Finally let 
Y n = A„B^ + E n , where E„ models white Gaussian 
noise (SNR=20dB). We first used the COBE, JIVE [2], 
JBSS [1], and PCA methods to extract the common com- 
ponents. Then we ran the SOBI method [32] to extract 
the latent speech signals (As the JBSS performed not so 
good in this simulation we also used SOBI to improve its 
results). TABLE 1 show the simulation results averaged 
over 50 Monte-Carlo runs, where SIR^, i.e., the signal-to- 
interface ratio (SIR) of the zth estimated signal, is defined 
as follows to evaluate the separation accuracy: 



SIR(s,s) = 10 log 



E 



10 



t °t 



EM 



(40) 



where s, 's are normalized random variables with zero 
mean and unit variance, and 5? is an estimate of s. It 
can be seen that JIVE and COBE achieved higher SIRs 
than JBSS and PCA, although the performance of JBSS 
has been improved after incorporating the SOBI method 
compared with its original version. Moreover, although 
PCA has a close relation with COBE, it can be seen 
again from the table that the common features extracted 
by PCA are often contaminated by individual features. 
COBE and JIVE almost achieved the same separation 
accuracy, but COBE was much faster. Particularly, the 
performance of JIVE is quite sensitive to the estimate of 
the rank of joint /common and individual components. 
If the rank is given accurately, JIVE performs well. 
Otherwise the efficiency will be significantly reduced. 
For example, for this instance if the ranks of individual 
components were specified as 7 (denoted as JIVE* in 
TABLE 1), which were actually 6, JIVE consumed more 
than 77 seconds to converge. In [2] a method to estimate 
the number of components was proposed, however, it is 
quite time consuming and the performance depends on 
skillful selection of its parameters (for this instance, JIVE 
costed more than two hours to estimate the rank). For 
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Fig. 6: Illustration of how to detect the the number of 
common components by locating the GAP between the 
values of -hfi, under different noise levels. 



the COBE method, first of all, the total time consumption 
depends on the number of common components and the 
size of the problem. This makes COBE much more effi- 
cient than JIVE. Moreover, the estimation of the number 
of common components is simpler and more intuitive. 
Generally, we can estimate the number of components by 
tracking the value of f t, as illustrated in Fig.6. As there is 
often a big GAP between the values of /, corresponding 
to the common components and the others, we can use 
SORTE [21] to detect the number of components. Note 
also that the threshold e bounds the correlations between 
the common components (see proposition 1), or how 
identical they are. This provides us another intuitive way 
to select the parameter. 

TABLE 1: Performance comparison in linked BSS. The 
latent signals were estimated by applying the SOBI 
method to the common components extracted by each 
algorithm. 



Algorithm 


SIRi 


SIR 2 


SIR 3 


SIR4 


Runtime (s) 


COBE 


21.1 


23.5 


23.9 


24.6 


0.5 


COBEc 


21.1 


23.3 


24.2 


24.7 


0.6 


JIVE 


21.2 


23.8 


24.2 


25.0 


7.4 


JIVE* 


21.2 


23.8 


24.1 


24.9 


77.5 


JBSS 


15.1 


15.4 


15.9 


16.3 


1.7 


PCA 


15.8 


17.1 


17.8 


19.4 


0.5 



In Fig. 7 we showed the performance in terms of 
running time and separation accuracy of COBE when 
we projected the observations into lower ip-dimensional 
space by multiplying an Ip x / random matrix P. 
The results were averaged over 50 independent runs. 
In each run the entries of the project matrix P were 
drawn from independent standard normal distributions. 
From the figure, when the value of Ip increases, the 
running time creased approximately linearly whereas the 
improvement on accuracy tends to mild, which justified 
the analysis in Section 2.6. Based on this fact, we may 
use projection to significantly improve the efficiency of 
COBE when / is extremely large. 
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Fig. 7: Illustration of the averaged performance of COBE 
after projecting the /-dimensional observations onto a 
lower Ip -dimensional space over 50 Monte-Carlo runs. 



Dual-energy X-ray image decomposition. Accurate 
detection of lung nodules using dual-energy chest X- 
ray imaging is an important diagnostic task to find 
the early sign of lung cancers [33]. Unfortunately, the 
presence of ribs, clavicles overlapped with soft tissues 
and environmental noise makes it quite challenging to 
detect subtle nodules. Accurate separation of bone from 
soft tissues is quite helpful to make correct diagnosis. In 
this experiment, we assumed that we had a series of X- 
ray images which were mixtures of soft and bone tissues 
and noise. The mixed soft and bone tissues formed 
their nonnegative common components. Our aim was 
to extract separated soft and bone tissues. We generated 
four sets of sources whose first two common components 
were respectively the soft and bone tissues and the 
other eight components were drawn from independent 
uniform distributions between and 1 to model in- 
terference. They were mixed via different mixing ma- 
trices whose elements were drawn from independent 
uniform distributions between and 1. It is known that 
the sources in this example are highly correlated and 
consequently they cannot be separated by using ICA 
methods. Due to the presence of random dense noise, 
they are also uneasy to be separated by using ordinary 
NMF algorithms on each single set of mixtures. As the 
soft and bone tissues existed in all images, we ran COBE 
to extract the basis of common sources and then used 
CNFE to extract the soft and bone tissues. One typical 
realization is shown in Fig. 8(b). Fig. 8(d) displays four 
samples of nonnegative components extracted by using 
the nLCA-IVM method [33]. Due to the presence of 
dense noise (thus the identifiability conditions of nLCA- 
IVM are not satisfied here), nLCA-IVM cannot extract the 
desired source images in this example. This experiment 
shows how the proposed method can be used to extract 
common nonnegative features, or equivalently, used as 
nonnegative high correlation analysis. 

Human face clustering. In this experiment we applied 
our method to human face images clustering. Three data 




(a) The sources (b) The images extracted by CNFE 
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(c) Samples of the observations /mixtures 
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(d) Partial images extracted by nLCA-IVM 

Fig. 8: Illustration of common nonnegative feature ex- 
traction. 



sets were tested: 

Yale Database 4 . The Yale face database contains 165 
grayscale images of 15 individuals. There are 11 images 
per subject taken at different facial expression (happy, 
sad, sleepy, etc) or configuration (light, with/ without 
glasses, etc). 

ORL Database 5 . The AT&T ORL database consists of 
400 gray scale images of 40 distinct persons. Each person 
has ten different images taken at different time, light, fa- 
cial expressions (open/ closed eyes, smiling /not smiling) 
and facial details (glasses /no glasses). All the images 
were taken against a dark homogeneous background 
with the subjects in an upright, frontal position. 

PIE Database 6 . The CMU PIE face database is a collec- 
tion of face images of 68 persons taken under different 
poses, illumination conditions, and expressions. Here we 
used the pre-processed version considered in [34] which 
consists of 2856 full frontal face gray scale images taken 
at the pose c27. 

In the experiments, all images were re-scaled with 
the size of 32 x 32. We randomly selected K clusters 
in each run, and repeated 50 times for each K. In 
each run, we permuted the images randomly first and 
then split them into K groups which formed multi- 
block data Yfe, k = 1,2, •■■ , K, (each group consisted 
of face images from unknown different clusters). We 
used COBE to extract the common features and then 
used CNFE to obtain nonnegative common features. The 
number of common components was specified as 2 in 
all experiments. Finally their two t-SNE components of 

4. [Online]: http:/ /cvc.yale.edu/projects/yalefaces/yalefaces.html. 

5. [Online]: http://www.cl.cam.ac.uk/research/dtg/attarchive/ 
facedatabase.html. 

6. [Online]: http://vasc.ri.cmu.edu/idb/html/face/. 
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TABLE 2: Clustering Performance on Yale 



k 


Accuracy (%) 


Normalized Mutual Information (%) 




PCA 


tSNE 


GNMF 


MMCut 


CIFA 


PCA 


tSNE 


GNMF 


MMCut 


CIFA 


11 


40.5±0.2 


47.1±4.0 


28.8±0.7 


45.4±0.5 


51.7±3.9 


45.0±0.8 


48.9±3.1 


26.6±0.4 


48.1±0.7 


54.7±3.8 


12 


49.8±1.2 


48.4±3.8 


29.5±0.2 


47.8±0.3 


49.9±2.8 


53.9±1.0 


50.9±2.9 


27.1±0.1 


47.9±0.6 


53.1±2.3 


13 


49.5±0.9 


50.0±3.4 


29.4±0.1 


43.3±0.1 


49.8±2.7 


56.2±1.1 


52.2±2.4 


29.9±0.2 


47.1±0.1 


54.0±2.4 


14 


46.2±0.6 


48.4±2.9 


28.6±0.4 


42.9±0.5 


47.3±2.6 


52.4±0.5 


51.8±2.1 


32.4±0.0 


48.3±0.6 


53.8±2.2 


15 


40.1±0.8 


45.3±2.3 


27.2±0.4 


42.4±0.0 


45.6±2.6 


48.3±0.9 


51.1±1.8 


31.7±0.4 


50.8±0.1 


52.7±2.2 


Avg. 


45.2 


47.8 


28.7 


44.4 


48.9 


51.2 


51.0 


29.5 


48.4 


53.7 



TABLE 3: Clustering Performance on ORL 



k 


Accuracy (%) 


Normalized Mutual Information (%) 




PCA 


tSNE 


GNMF 


MMCut 


CIFA 


PCA 


tSNE 


GNMF 


MMCut 


CIFA 


20 


69.0±0.1 


69.7±3.7 


53.5±0.0 


64.8±1.8 


70.3±3.4 


78.3±0.1 


80.5±1.9 


65.2±0.5 


75.1±0.9 


81.1±2.1 


25 


64.8±0.3 


67.6±3.1 


55.1±0.7 


65.3±0.7 


76.8±3.5 


79.6±0.3 


82.0±1.4 


70.3±0.5 


77.8±0.1 


86.2±2.0 


30 


67.9±0.8 


67.1±2.7 


51.0±0.1 


71.2±0.8 


77.6±3.0 


79.5±0.4 


80.9±1.0 


69.5±0.2 


80.7±0.4 


87.1±1.4 


35 


62.8±0.2 


63.9±2.1 


49.2±0.4 


68.8±0.5 


75.1±2.6 


78.7±0.3 


80.3±0.9 


66.7±0.6 


79.2±0.0 


86.2±1.2 


40 


58.9±0.4 


61.9±1.9 


45.6±1.0 


64.8±0.5 


74.9±2.6 


77.1±0.1 


79.9±0.9 


66.7±0.7 


77.9±0.3 


86.6±1.1 


Avg. 


64.7 


66.0 


50.9 


67.0 


75.0 


78.6 


80.7 


67.7 


78.2 


85.5 



TABLE 4: Clustering Performance on PIE 



k 


Accuracy (%) 


Normalized Mutual Information (%) 




PCA 


tSNE 


GNMF 


MMCut 


CIFA 


PCA 


tSNE 


GNMF 


MMCut 


CIFA 


30 
40 
50 
60 
68 


21.2±0.4 
20.4±0.2 
20.0±0.0 
19.6±0.0 
19.1±0.2 


30.5±1.2 
30.6±1.1 
29.2±1.2 
28.9±1.1 
28.2±1.0 


71.8±0.6 
84.0±1.0 
75.4±0.3 
68.5±0.2 
74.0±0.4 


62.2±0.2 
65.1±0.4 
59.8±0.0 
54.5±0.4 
60.5±0.4 


88.1±6.2 
86.1±2.5 
85.9±1.9 
85.9±2.5 
85.6±2.0 


33.3±0.5 
36.9±0.1 
39.8±0.1 
40.6±0.0 
40.1±0.1 


51.6±0.8 
55.6±0.8 
57.0±0.6 
57.4±0.8 
57.8±0.6 


85.5±0.6 
90.3±0.2 
89.0±0.0 
87.0±0.0 
87.9±0.1 


77.7±0.1 
79.6±0.0 
78.7±0.2 
78.5±0.1 
79.2±0.2 


95.4±4.1 
95.1±0.8 
95.2±0.7 
95.1±0.8 
95.0±0.7 


Avg. 


20.1 


29.5 


74.7 


60.4 


86.3 


38.1 


55.9 


87.9 


78.7 


95.2 



their individual parts Y„ were used to cluster the data 
by using if -means (See [35] for the t-SNE method). As 
if-means is prone to be influenced by initial centers 
of clusters, we replicated if-means 20 times in each 
run. Two widely used performance indices Accuracy 
(%) and Normalized Mutual Information (NMI) were 
adopted to evaluate the clustering results, see [34] for 
their definitions. The proposed method was compared 
with PCA with 50 principal components, GNMF [34], 
and the improved MinMax Cut (MMCut) method [36]. 
To justify that the performance of the proposed method 
was not completely due to t-SNE, two t-SNE components 
of the original data were also used as features for clus- 
tering. The clustering performance of these algorithms is 
detailed in TABLE 2, 3, and 4, respectively. From these 
tables, after remove the common features existing in all 
instances, the clustering performance can be significantly 
improved. 

We investigated how the parameter c influenced the 
clustering performance, see Fig. 9. We changed the value 
of c from 1 to 10. For the Yale and ORL dateset, the COBE 
method achieved the best results for c = 2. For the PIE, 
the performance is relatively stable after c > 2. How to 
set the optimal parameter c blindly remains challeng- 
ing for practical applications. Fortunately, it seems that 
mostly degeneration caused by the overestimation of c 
is not radical. In fact, 

=Y n -£° =i a fc (a£Y„). 



where a fc and b„ fc are the kth column of A and B„, re- 
spectively. When c has been overestimated, which results 
in the the correlations between a c and the components 
in Y„ are quite small (close to 0, ideally). In this case 
t> ra ,c = a^Yn, i-e., the projection of Y„ on a c becomes 
very small. In other words, the loss of individual features 
tends to very small after c has been overestimated. 

Applications in classification. We applied the pro- 
posed method described in Fig.4 to classification prob- 
lems. Two databases were tested: 

The Extended Yale Face Database B. The database has 38 
individuals and around 64 near frontal images under dif- 
ferent illuminations per individual [37]. Here we simply 
used the cropped images analyzed in [34], [38] 7 , where 
each image was re-scaled with the size of 32 x 32. 

The ETH-80 Database 8 . The ETH-80 database consists 
of a total of 3280 images of 8 categories, each of which 
contains 10 objects with 41 views per object, spaced 
equally over the viewing hemisphere [39]. Note that each 
category contains 10 different objects (although these 
objects belong to the same category and share some 
common features, they have their individual features 
different from the other objects in the same category), 
which makes this database widely adopted to evaluate 
classifying methods. See Fig.10 for the 8 categories in 
ETH-80 and the 10 objects in the forth category 

7. We used the version provided at http://www.cad.zju.edu.cn/ 
home / dengcai / Data / FaceData.html. 

8. Available at http://www.d2.mpi-inf.mpg.de/sites/default/files/ 
datasets/eth80/eth80-cropped-closel28.tgz 
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Fig. 9: Illustration of how the parameter c influences the clustering performance for each data set. 
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(b) The 10 objects in the forth category 

Fig. 10: The ETH-80 database. 
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Fig. 11: Mean values and the standard derivations of the accuracy in the classification of (a) the extended Yale 
human face database B and (b) the ETH-80 database over 20 random runs. The proposed method achieved the best 
classification accuracy among them. 



We compared our method with three methods: the 
K-nearest Neighbor (KNN) classifier (included in MAT- 
LAB2010b), the SVM classifier [40], and the linear dis- 
criminative analysis (LDA). For the KNN classifier we 
used the 5 nearest neighbors. For SVM, we used 5-fold 
cross validation mode, where the best parameters of c 
and g were found using grid search in the interval 



and 2 with the step 0.1 and in the interval 10~ 2 and 10 _1 
with the step 10~ 2 , respectively (however, the SVM was 
absent from the comparison on the ETH-80 database as 
it consumes almost unaffordable running time). Because 
LDA requires that the number of samples should be 
sufficiently larger than the number of features (to ensure 
the covariance matrix to be positive definite), we used 
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their 50 principal components as the features for training 
and classification. In each run, we randomly selected a 
certain percentage of samples as the training data and 
the remainders as the test data. In our common feature 
analysis based classification routine, we split the training 
samples belonging to each class into two subgroups 
and then used COBEc to extract their common features, 
where c = min„ J n x 80%. For the ETH-80 database, 
because each image has the size of 128 x 128 x 3, the size 
of each sample is 49,152. Hence we used the projection 
method described in Section 2.6 to improve the efficiency 
of the COBE method by setting Ip = 1000. Then we 
adopted correlation as the matching score to classify a 
new test sample (See Fig .4 for details). The mean value of 
classification accuracy and the standard derivation over 
20 random runs are plotted in Fig.ll. From the figure, 
COBEc is robust and provides the best classification 
accuracy among them. 

6 Conclusions and Future Work 

A new scheme of common and individual feature analy- 
sis (CIFA) for naturally linked multi-block data was pro- 
posed in this paper. First, two new efficient algorithms 
were proposed to extract common orthogonal basis ac- 
cording to whether the number of common features is 
known or not. Then feature extraction was performed 
on the common and the individual spaces of the data, 
respectively. We investigated how the proposed scheme 
of CIFA can be applied to improve the performance 
of classification and clustering tasks, by exploiting the 
separated common and individual features, respectively. 
Finally, extensive simulations on synthetic data showed 
that the proposed methods are able to extract com- 
mon features existing multi-block data efficiently and 
accurately, and experiments on real data showed that 
the proposed CIFA has quite promising applications in 
classification and clustering tasks. 

In this study we concentrated on developing a gen- 
eral scheme of common and individual feature analysis. 
Some questions remain to be investigated in the future: 

1) The number of common features, i.e., c (or equiva- 
lently, the parameter e) often plays quite important 
role in practical applications. The SORTE method 
may be a good choice, but it only works for 
Gaussian noise with relatively high SNR. How to 
find the optimal parameter theoretically deserves 
further study. 

2) For some practical applications, we have to split 
the data into subgroups manually to discover their 
common features. How to group the data is also 
quite important to achieve better performance. 

3) We discussed how the separated common features 
can be used for classification tasks initially. By split- 
ting common and individual features of objects, 
we think the proposed method may be tailored 
for more structurally complex classification tasks 



by incorporating suitable dimensionality reduction 
and feature selection methods. 
4) In this paper we only considered common features 
in one dimension. Nowadays high-dimensional 
data are more and more common. How to ex- 
tend the proposed method for high-dimensional 
data and extract common and individual features 
from multiple dimensions rather than only one 
should be quite interesting and promising. The 
PVD method in [3] has made pioneering work in 
this direction. However, it requires that the multi- 
block data are of the same size, which is restrictive. 
More general and flexible CIFA tools for high- 
dimensional tensor data is desired. 
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